Emergence of time-horizon invariant correlation structure in financial returns by subtraction of the 

market mode 
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We investigate the emergence of a structure in the correlation matrix of assets' returns as the time-horizon 
over which returns are computed increases from the minutes to the daily scale. We analyze data from different 
stock markets (New York, Paris, London, Milano) and with different methods. Result crucially depends on 
whether the data is restricted to the "internal" dynamics of the market, where the "center of mass" motion (the 
market mode) is removed or not. If the market mode is not removed, we find that the structure emerges, as the 
time-horizon increases, from splitting a single large cluster. In NYSE we find that when the market mode is 
removed, the structure of correlation at the daily scale is already well defined at the 5 minutes time-horizon, and 
this structure accounts for 80 % of the classification of stocks in economic sectors. Similar results, though less 
sharp, are found for the other markets. We also find that the structure of correlations in the overnight returns is 
markedly different from that of intraday activity. 
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I. INTRODUCTION 



Besides their intrinsic interest, financial markets have also 
attracted a great deal of attention as a paradigm of complex 
systems of interacting agents. In this view, the correlation 
between different assets are one of the signatures of the com- 
plexity of the system's interactions and, as such, have been 
the focus of intense recent research [1^ 2^. The central ob- 
ject of study is the empirical covariance matrix of a set of 
N assets, whose elements are the Pearson's correlation co- 
efficients Cj'^"* (T) between the log-returns of assets i and j 
over a time-horizon r, measured on historical time series of 
length T. Early studies have focused mainly on daily returns 
(t — 1 day) and have shown that the bulk of the eigenvalue 
distribution of the correlation matrix is dominated by noise 
and described very well by random matrix theory fHQl- This 
"noise" band of noisy eigenvalues shrinks as ^JN/T as the 
length T of dataset increases, but it is significant for typical 
cases where N and T are of the order of some hundreds. The 
few large eigenvalues which leak out of the noise background 
contain significant information about market's structure. The 
taxonomy built with different methods 10. S S 0] from fi- 
nancial correlations alone bears remarkable similarity with a 
classification in economic sectors. This agrees with the ex- 
pectation that companies engaged in similar economic activ- 
ities are affected by economic factors in a similar way. With 



respect to their dynamical properties, it has been found that, 
financial correlations are persistent over time |8] and that they 
follow recurrent patterns 171] . 

Furthermore, correlations "build up" as the time-horizon r 
on which returns are measured increases, and they saturate 
for returns on the scale of some days ||9i[l0j]. This behavior, 
known as the Epps effect] 1 IJ, is a manifestation of the process 
of mutual information exchange across assets. It quantifies 
how this information flow is ultimately "incorporated" into 
correlations, in much the same way as information on single 
assets is incorporated into their prices. Interestingly, it was 
found that such a process is much faster today than in the past 
and more pronounced for more capitalized stocks |9]. It has 
also been remarked lfl2i ITsIl that the structure of correlations 
changes as the time-horizon r over which returns are defined 
increases, i.e. that "pictorially, the market appears as an em- 
bryo which progressively forms and differentiates over time" 



pryo 



'Electronic address: [marsiU@ictp.it| 

talso DEMOCRITOS 

■f Electronic address: [micciche@lagash.dft.unipa!it] 



Here we shall take a closer look on the dependence of the 
structure of correlations on the time-horizon r and show that 
the observed evolution of the market structure is due, to some 
extent, to the dynamics of the market mode. Global congela- 
tions play a dominant role at high frequency, thus giving rise 
to correlation structures which are much more clustered than 
at the daily scale. However, if global correlations are removed, 
the structure of correlations at the daily scale, is largely pre- 
served across time-horizons, down to a scale of 5 minutes for 
the most Uquid market we have analyzed. Loosely speaking, 
the network structure, after removing the market mode, ap- 
pears fully formed and differentiated already at small scales, 
it only grows in size (of correlations) as the time-horizon in- 
creases. 
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The effect of disentangling the effect of the market mode 
when computing pairwise correlations between stocks is anal- 
ogous to decomposing the dynamics of a complex interacting 
system in that of its center of mass and of its internal coor- 
dinates. This is obvious in physics, where the center of mass 
dynamics is determined by external forces, whereas internal 
coordinates mainly respond to inter-particle interaction forces. 
By analogy, our results suggest that in order to understand the 
dynamics of inter-asset correlations, it makes sense to elimi- 
nate the effect of the "center of mass". 

The paper is organized as follows: in the next Section we 
discuss the datasets and how we build correlation matrices. 
Then we shall discuss the results of data clustering approach 
in Section Hn] first for NYSE and then for the other markets. 
The following Section deals with the Minimal Spanning Trees 
approach. Finally we shall summarize our results and offer 
some concluding remarks. 

II. THE DATA 

In this paper we empirically investigate the ensemble be- 
haviour of price returns for 4 different markets: the New York 
Stock Exchange (NYSE), the London Stock Exchange (LSE), 
the Paris Bourse (PB) and the Borsa Italiana (BI). All data 
refer to year 2002. 

The NYSE data are taken from the Trades and Quotes 
(TAQ) database maintained by NYSE IH. In particular, 100 
highly capitalized stocks were considered. For each stock and 
for each trading day we consider the time series of stock prices 
recorded transaction by transaction. Since transactions for dif- 
ferent stocks do not happen simultaneously, we divide each 
trading day (lasting 6'* 30') into intervals of length r. For 
each trading day, we define intraday stock price proxies 
Pi{tk) of asset i, with A: = 1, • • • , Nr- The proxy is defined as 
the transaction price detected nearest to the end of the interval 
(this is one possible way to deal with high-frequency finan- 
cial data \\5\). By using these proxies, we compute the price 
returns 

af\t)^\np,{t)-\np,{t-T) (1) 

at time-horizons r. The time-horizon used are r = 
5, 15, 30, 65, 195 minutes. For NYSE, values of r are large 
enough that all the considered stocks have at least one trans- 
action in each time interval. 

The LSE data are taken from the "Rebuild Order Book" 
database, maintained by LSE [16]. In particular, we consider 
only the electronic transactions for 92 highly traded stocks 
belonging to the SETl segment of the LSE market. The trad- 
ing activity has been defined in terms of the total number of 
transactions (electronic and manual) occurred in 2002. How- 
ever, most of the transactions, a mean value of 75% for the 
92 stocks, are of the electronic type. This market is com- 
monly believed to be very active and can be regarded as a 
realization of a "liquid" market. For each stock i and for 
each trading day we consider the time series of stock price 
recorded transaction by transaction and generate intra- 
day stock price proxies pi {tk ) according to the procedure ex- 



plained above. For the LSE data, the time-horizon used were 
5, 15, 30, 51, 102, 255 minutes. Each trading day lasts 
8'' 30'. 

The PB data are taken from the "Historical Market Data" 
database, maintained by EURONEXT ITtIi . In particular, we 
consider the electronic transactions of two subsets of stocks 
traded in the year 2002. For each stock i and for each trading 
day, lasting 8'' 30', we consider the time series of stock price 
recorded transaction by transaction and generate Nr intra- 
day stock price proxies pi {tk ) according to the procedure ex- 
plained above. One first set, which will be analyzed in Section 
Hin consists of the 75 most frequently traded stocks at time- 
horizons Tk — 27 ■ 2*^ seconds, for k — 0, . . . , 10. An anal- 
ogous dataset was derived considering tick time: Tj, ' ' — 
100 • 2*^. This choice was considered in order to probe the 
region of very high frequencies and to assess the relevance 
of time inhomogeneity of trading activity at intraday time 
scales. In this respect, it is worth to remark that for small 
T stocks were not traded in each time interval. A second 
dataset, that will be considered in Section IIVI instead con- 
sisted of = 39 stocks which were continuosly traded in the 
entire 2002 (i.e. in each time interval) over time-horizon of 
T = 5, 15, 30, 51, 102, 255 minutes. 

The BI data are taken from the "Dati Intraday" database, 
maintained by Borsa Italiana |18]. In particular, we con- 
sider only the electronic transactions occurred for 30 stocks 
continuosly traded in the entire year 2002. For each stock i 
and for each trading day we consider the time series of stock 
price recorded transaction by transaction and generate in- 
traday stock price proxies pi {tk ) according to the procedure 
explained above. For the BI data, the time-horizon used were 
5, 15, 30, 60, 120, 240 minutes. Each trading day lasts 8''. 

For all markets, in addition to the intraday time-horizons, 
we have considered returns on the daily time-horizon 

afP-^''H = logpf(n)-logp°P(n), (2) 
af-''\n) = \ogpf{n)-logpf{n-l), (3) 
a't'^'''\n) - logp°P(n) - logpf{n - 1), (4) 

corresponding to intraday, daily and overnight returns, respec- 
tively. Here p°^ (n) and pf{n) are the open and closure prices 
of stock i in day n. 

Each stock can be associated to an economic sec- 
tor of activity. For the NYSE data we consid- 
ered the classification scheme given in the web-site 
http://finance.yahoo.com/, for the LSE and BI 
data we considered the classification scheme used in the 
web-site www.euroland.com, for the PB data we con- 
sidered the classification scheme used in the web-site 
http : / / www . euronext . com/. The relevant economic 
sectors are reported in Table|T] 

Given the price return at a selected time-horizon r, we built 
the correlation matrices in the usual way 
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TABLE I: Color codes for the economic sectors of activity for the 
stocks. 
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Healthcare 


grey 


7 


Basic Materials 


violet 
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Services 


cyan 


9 


Utilities 


magenta 


10 


Capital Goods 


light green 


11 


Transportation 


maroon 


12 


Conglomerates 


orange 



Here and in what follows, (...) = (l/Nr) X^tJi ■ • ■ denotes 
time average. 

In order to disentangle different components of the dynam- 
ics and to understand their effect, we considered also series of 
datasets derived from a'f\t). In all derived datasets we sub- 
tract a particular component of market dynamics from the rest. 
When the structure of the derived dataset differ substancially 
from that of the matrix A we can conclude that the decompo- 
sition is meaningful and informative. 

First we removed the "center of mass" dynamics: 

1 ^ 

(t) 

From this, a covariance matrix Bl ■ was computed in the same 
way as in Eq. ([Sj. 

In a further dataset we removed the effect of the market 
index from ai^ {t). This was done first considering the time- 
series /("^^ (t) of the corresponding market index at the same 
time-horizon r and then estimating the coefficients of a one 
factor model 

a^;\t)^a,+p,&\t)^c^;\t). (7) 

The residuals cf^ (t) were used to build the covariance matrix 

C^'^J^ . We could build the time series /'"^^ (t) only in the case 
of NYSE data, for which we had access to intraday data of the 
SP500 composite index. 

In all datasets we computed an "endogenous" market index 
using the market average return 

Using this instead of the market index /^^^(t) in Eq. O and 
considering the residues d'f ^t), we computed a further co- 

(t) 

variance matrix D) ■' . 



Finally, we produced a dataset by removing the contribu- 
tion of the largest eigenvector of the matrix A'fj. This can be 

done by zeroing the largest eigenvalue of A, as discussed in 
Ref. 041]. An alternative method, which we prefer, is that of 
removing the "optimal" factor, G'-'^-' (t) which is obtained by 
minimizing 

1=1 t=l 

on at, Pi and G'-'^'' (t). The residues e'f ^ (t) resulting from this 

operation coincide with the time-series obtained from a,-'^'' (t) 
by subtracting the leading contribution of its singular value 

(t) 

decomposition. We call ^ the coiTelation matrix of the 

residues e'f\t). 

In summary, we consider the original time-series (set A), 
the one obtained subtracting the average market return (set B) 
and those obtained from the residues of a one factor model 
with the market index (set G), the average market return (set 
D) and the optimal factor (set E). Set G represents a case 
where the market mode is exogenously determined whereas 
in sets D and E it is determined by the data itself. This allows 
us to understand how much an index, such as SP500 which is 
a weighted average, accounts for the collective dynamics of 
the market. 

The distribution of matrix elements is shown in Fig. [T]as a 
function of time-horizon (top, for the sets A and B) and for 
different datasets at the intraday time-horizon. We observe 
that the distribution spreads out as the time-horizon increases, 
as a manifestation of the Epps effect. However, while the dis- 
tribution of Ai,j is centered around a positive value, that of 
coiTelations of derived datasets is peaked on values close to 
zero and is narrower. For set B (D and E) the peak is at 
slightly negative values, whereas for set G it occurs at posi- 
tive values. This suggests that the removal of correlations is 
more efficient when the single factor is computed from the 
data. This already shows that the dynamics of the mean a(t) 
already explains the correlations better than the market index. 

We also find that intraday and overnight returns have dis- 
tinctly different distribution of correlation coefficients. This 
difference is particularly pronounced in dataset G which again 
suggest that the market index is even less expUcative of the 
market's collective behavior at these scales. 

Correlation Dij and Eij were found to have a distribution 
which is similar to that of Bi j. This anticipates a generic 
conclusion: the subtraction of a global component from the 
dynamics is most meaningful when it eliminates (either im- 
plicitly as in B or explicitly as in E) the market mode by 
setting the coiTesponding eigenvalue to zero. 

Before analyzing the structure of correlations, it is of inter- 
est to provide some estimate of the relative strength of global 
coiTelations and of noise in the correlation matrices A. Fig. |2] 
plots the share of correlation carried by the largest eigenvalue 
A (which is A/TV, by normalization) for NYSE, LSE and PB, 
as a function of time-horizon t. As a manifestation of Epps 
effect 1 11], this increases with r in a way which is reason- 
ably well approximated by a logarithmic growth. The ratio of 
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FIG. 1: Distribution of correlation coefficients Aij and Bij for dif- 
ferent time-iiorizons r (top) and at the intraday time-horizon for dif- 
ferent datasets (NYSE data). 



the second largest eigenvalue A to the largest, which could be 
taken as a measure of the relative strength of inter-asset corre- 
lations against global correlations, has a declining trend with 
T for small time-horizons and then saturates at around 0.1. 
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FIG. 2: Largest eigenvalue A/N, divided by the number of assets, 
of the matrix as a function of r for NYSE, LSE and PB (full 
symbols). Ratio A/A of the second largest to the largest eigenvalue 
of At, as a function of r (open symbols). 



III. DATA CLUSTERING 



where r/g (t) and r/i (t) are independent gaussian variables with 
mean zero and unitary variance. Here ?7s(<) describes the 
component of the dynamics which is common to all time se- 
ries Xi{t) with Si = s whereas ei(t) describes idiosyncratic 
fluctuations. Eq. (O is consistent with a correlation ma- 
trix Xi,j — {xiXj) which has a block diagonal structure for 
T 



oo: 



X,, 



if Si 



Si and Xi 



otherwise. The 



parameters entering Eq. (O as well as the cluster structure 
{si} can be determined by maximum likelihood estimation. 
Approximate maximization of the log-likelihood can be done 
following an hierarchical clustering procedure |24]: start with 
N clusters, each composed of a single asset (sf^ — i). From 

the configuration {s^^^^''} with K + I clusters, compute the 
log-likelihood of all configurations obtained by merging two 
clusters. The configuration {s^^-*} with K clusters is the one 
coiTesponding to the maximal log-likelihood £k. This opera- 
tion can be iterated with K going from — 1 to 1, and the opti- 
mal configuration can be chosen as that for which £k is max- 
imal. This also predicts the optimal number K* of clusters 
which describes our dataset. This method has already been 
used to analyze stock market data: in Refs. ^ the emergent 
clusters were found to be highly coiTelated with economic ac- 
tivity. Furthermore the method was extended to perform noise 
undressing. In Ref. |7] the method has been applied to inves- 
tigate market dynamics, showing that well defined recurrent 
states of market wide activity can be defined. 

Here we apply this method to investigate how the structure 
of market's coiTelations evolves as the time lag r increases 
from the high-frequency range to the daily scale. We shall 
first focus on NYSE and then discuss the differences found in 
other markets. 




We performed data clustering analysis following the 
method of Ref. [6]. Here we only sketch the basic idea of 
the method and we refer the interested reader to Ref. for 
details. In brief, assume we wish to cluster N standardized 
ll23ll time series Xi{t) in groups having a similar dynamics. 
First we assign a cluster label Si to each time series, specify- 
ing which cluster it belongs to. Then we assume that Xi (t) is 
generated according to the model 



FIG. 3: Top: Number of clusters for datasets A,B,C,D and E. 
Bottom: number of clusters accounting for 90% of the likelihood 
(NYSE data). 



A. NYSE 

Fig. |3] shows the evolution of the number of clusters with 
the time-horizon t for the different datasets in the NYSE. For 
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A we find fewer clusters then with other methods and the 
number of clusters increases with r. This is consistent with 
results of Refs. [12] which observe an evolution of the struc- 
ture of correlations, where more and more details are added 
as the time-horizon increases. The other datasets, however, 
reveal that this is due to the fact that A includes the correla- 
tions induced by the common factor When this is removed, 
as for B, D and E, we find that the number of clusters which 
accounts for most of the log-likelihood is remarkably stable 
from the 5 min to the intraday scale. When the S&P500 in- 
dex is removed from the data (C), we find a fast evolution of 
the structure between 5 min and 30 min and then the number 
of clusters saturates to a constant level. Again, in all cases, a 
significant variation takes place in the overnight and hence at 
the daily (cl-cl) scale. 
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FIG. 4: Evolution of the cluster structure with time-horizon for the 
set A (top left) B (top right), C (bottom left) and E (bottom right) 
of NYSE. The cluster label s^^' of each asset belonging to the most 
relevant clusters is shown as a function of r. In this way, assets 
who always belong to the same cluster follow the same "trajectory" 
(indeed trajectories of different assets i are shifted by a small random 
variable to distinguish them). The color is relative to the cluster 
structure at the intraday scale. 

A closer view on the evolution of the cluster structure is pre- 
sented in Figs. |4] This plots the cluster label s|^^ as a function 
of T, for each asset belonging to clusters accounting for 90% 
of the log-likelihood |25]. Hence assets i and j belonging to 
the same cluster for all r, follow parallel trajectories in the fig- 
ure. In this representation, cluster splitting and merging can 
clearly be read off. In dataset A and C we see considerable 
splitting of clusters as we move from t =5 min to the daily 
time-horizons. A substantial reshuffling and merging takes 
place when going to overnight returns. On the contrary, in 
dataset B, E and D (not shown), cluster membership exhibits 
a remarkable stability at intraday scales: the vast majority of 
assets within a cluster at 5 min, follows the same "trajectory" 
across time-horizons. Some reshuffling takes place in the or- 
der of clusters, suggesting that the structure of correlations 
among sectors evolves with time-horizons. Again, the struc- 
ture of overnight returns is considerably different. 

In order to make the comparison of different cluster struc- 
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op-cl 


A 
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11 


42 


77 


86 


100 


89 


24 
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91 


90 


91 


90 
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90 


90 


72 
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33 


66 


84 


86 


87 


92 


89 
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91 


90 


92 


91 


89 


92 


90 


78 


E 


91 


87 


90 


87 


90 


90 


90 


80 



TABLE II: Overlaps Si between cluster structures at different time- 
horizons and different sets and the structure of set yl at r = 1 day. 



ture quantitative, we have introduced an information distance 

5(5(^^5^^)) between any two structures {Sj-^"*} and {s^^*}. 
In words, this tells us how much the knowledge of the clus- 
ter label s'p of a randomly chosen stock i, yields information 

(2) 

on the value of s ■ . Information is quantified by entropy re- 
duction, in the following manner: Let p(^)(s) be the fraction 



of stocks with s 



s for^ = 1,2 and p(i|2)(s|s') be the 

fraction of stocks with sp"* = s, among those which have 

s'. From these, we can compute the entropies S'^^^ in 
the usual way and the conditional entropy 



0(2) 



5'(l|2) ^ 



/ ^ 1 



p(2)(s')^p(i|2)(s|s')logp(i|2)(s|s')- 



The information gain is then given by 

^ ^(1) _ g(l|2) 



(9) 



Because of the normalization, a value of « 1 implies that 
s(2) yields a rather precise information on s'^^\ so if 3 = 0.8 
we shall say that s'^) accounts for 80 % of the information 
contained in s^^^. Table UlLAl shows the values of Q (in %) 
between different cluster structures and that obtained from set 
yl at r =1 day time-horizon. This shows that at this time- 
horizon, the cluster structure is essentially the same in the five 
datasets, with an overlap larger than 90 %. An overlap of the 
same order of magnitude attains for all intraday scales in sets 
B, D and E. Even though the overlap drops down as one 
moves to overnight returns, the difference is much smaller in 
sets _B, D and E than in sets A and C. This suggests that, even 
though overnight returns have a structure which is markedly 
different from that of intraday returns, still removing the mar- 
ket mode allows one to reveal more invariant features. 

Such invariant features, we claim, are related to economic 
sectors. In order to support this, we compare the cluster struc- 
tures with the classification of assets in the sectors of eco- 
nomic activity given in Table U The latter, yields a sector 
label Ci e {!,..., 12} for each stock i, for which we can 

compute an information gain 3, as above, setting s^^-* = e^. 
Fig. |5] shows the behavior of 3 for different datasets across 
time-horizons. This suggests that the most informative sets 
are those where the market mode is removed and these ac- 
count for 80% of the information contained in e^. For these, 
the information content is remarkably constant across time- 
horizons. On the contrary, for set A the information gain 3 
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increases with r in the intraday range, as if information on the 
economic activity of assets were "released" gradually, as time- 
horizon increases. It is worth to remark that, for all datasets, 
overnight returns (specially for sets A and C) carry much less 
information on the economic structure of the market, than in- 
traday returns. 

Hence, we conclude that in datasets A and C the evolu- 
tion in the cluster structure is due to the interplay between 
the "center of mass" motion (i.e. the market mode) and the 
internal dynamics. Indeed when the latter contribution is sub- 
tracted from the data, as in datasets B, D and E, we find 
that the structure of correlations is remarkably stable with the 
time-horizon. This is consistent with a notion of market's in- 
formational efficiency by which information is incorporated 
very quickly in market's returns. From the above analysis, we 
infer that the information on the relations between assets is 
efficiently incorporated in returns over time-horizons shorter 
than 5 min in NYSE. 




5 min 15 min 30 min 65 min 1/2 day ci-op cl-ci op-cl 



FIG. 5: Information gain on the classification in economic sectors 
given by the knowledge of cluster structures s''^' at different time- 
horizons r, for different datasets a, . . . ,e. In order to avoid effects 
due to differences in the number of clusters, we considered maximum 
likelihood structures with 20 clusters for all datasets. Notice that by 
normalization < < 1. 



B. Other markets 

We have performed data clustering analysis also on LSE 
and PB data. Again we found that removing the market mode 
allows one to reveal the structure of correlations much more 
clearly. Indeed, while set A is characterized by one or two 
clusters at intraday time scales, set B, . . . ,E are character- 
ized by a richer structure, as shown in Figs. |6]and|7] In both 
cases, we see that a significant part of the structure forms at 
intermediate time-horizons of 15 - 30 minutes. In PB data, 
we pushed our analysis to ultra-high frequency, probing very 
short time scales. We found that for t <5 min barely any 
structure can be seen in the correlation matrix. As for NYSE, 
we found that the cluster structure of set A is poorly correlated 
with the classification of assets in economic sectors, whereas 
datasets B and E cluster in a way which reflects up to 70% of 



the (entropy of a) classification in economic sectors for LSE, 
and that this information content is roughly constant across 
time (intraday) scales. 

As for the NYSE, we found that overnight returns have a 
cluster structure which is markedly different from that of in- 
traday returns. Different markets, however, exhibit different 
patterns in this respect. While the LSE has a fragmented clus- 
ter structure of overnight returns similar to NYSE, PB shows 
a more compact structure. 




5 mm 15min 30 mm 51mm 102day 255 mm cl-op cl-cl op-cl 



FIG. 6: Evolution of the cluster structure for set E of LSE. 




1 min 4 min 15 min 1h 4h cl-op ci-cl op-ci 



FIG. 7: Evolution of the cluster structure for set B of 75 stocks in 
PB. 

In contrast with our findings on NYSE data, the cluster 
structure of set A is now markedly different from that of other 
sets even at the daily scale. This suggests that the role of 
global correlation is much stronger in LSE and PB. 

In order to compare different markets, we performed the 
Kolmogorov-Smirnov (KS) test [19J on the distribution of 
cluster sizes. This provides a p value for the hypothesis that 
two different samples {sf} and {s^} of cluster sizes can be 
considered as different populations drawn from the same un- 
known parent distribution. If this is not the case (i.e. p is 
small), we can conclude that the two samples have a different 
structure, whereas if p is close to one, we cannot reject the 
hypothesis that the two samples have the same structure. We 
found that LSE and PB have a cluster size distribution which 
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is different from that of NYSE (p ~ 0.1), but which are re- 
markably similar one to the other (p ~ 1). 

The similarity between LSE and PB, and their difference 
with NYSE, is also visible in the dependence of the largest 
eigenvalues on r shown in Fig. |2l Remarkably, the market 
mode seems stronger in NYSE than in LSE and PB, whereas 
data clustering suggests the opposite. 

In the case of PB data, we also performed several tests in 
order to asses the sensitivity of our results on the inhomogene- 
ity of trading activity. One may indeed think that particular 
times of the day, such as the opening or the closure of the 
market, peak or lunch break hours, might be characterized by 
different statistical properties. In order to test for these effect, 
we removed the first and the last 20 minutes of trading from 
the data in each day and considered the resulting correlation 
matrices A' ,B' , . . .. We computed the relative information 3 
between the maximum likelihood structures obtained in this 
way and the original ones, at different time scales t. The re- 
sult is that, for set B of PB, at all r roughly 3 ~ 70% of 
the structure found in the whole dataset coincides with that 
obtained eliminating the opening and the closing period (see 
Fig. [SJ. An even stronger similarity (3 — 0.83) was found 
in NYSE between the structure of intraday correlations and 
those obtained from returns measured roughly 30 minutes af- 
ter opening and before closing. We conclude that a significant 
part of the structure is not affected by the activity at the market 
opening or at closure. 

As a further test to check the effects of time inhomogeneity 
of trading activity, we compute correlation matrices in tick 
time for PB, over intervals of t^.*"^'^'' = 100 • 2'^ ticks, which 
correspond on average to the time scales used in real time 
(here a tick is defined as a transaction on any of the stocks 
considered). The results, shown in Fig. [8] suggest that the 
structure of market correlation is largely independent of the 
definition of time, as indeed roughly 80% of the information 
found with real time is recovered using tick time. 
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FIG. 8: Relative entropy 9 of the cluster structures of set B of PB 
obtained for ;) tick and real time (circles) on maximum likelihood 
structures (filled) or structures with 20 clusters (open) and /() with 
and without the opening and closure period of roughly 30' (filled 
squares). 



IV. SINGLE LINKAGE CLUSTERING ANALYSIS 

In this section we review the results obtained by applying 
the Single Linkage Clustering Algorithm (SLCA) to the data 
considered in section For each time-horizon considered, 
the SLCA allows to obtain a Hierarchical Tree (HT) and a 
Minimum Spanning Tree (MST), which give complementary 
information about the network structure of the considered set 
of stocks. Indeed, the HT gives a description of the hierarchi- 
cal organization of the stocks, while the MST gives an indi- 
cation about their topological organization. For a review of 
SLCA in the context of multivariate financial time series we 
referto|[il|23,|2lll. 

As much as in the previous section, here we apply the 
SLCA to the different datasets in order to investigate how the 
structure of market's coiTelations evolves as the time-horizon 
r increases from intraday scales to the daily scale. We shall 
first focus on NYSE and then discuss the differences found in 
other markets. The colors used in the representation of both 
the HTs and the MSTs refer to the classification is sectors of 
economic activity given in Table |T] 



A. NYSE 

The investigation of NYSE data by using the SLCA reveals 
that the role of the "center of mass" in the structure of the cor- 
relation is twofold. On one side, the level of clustering in all 
the HTs in the sets where the "center of mass" is removed is at 
an higher distance than the coiTesponding HTs of set A. Such 
effect is expected since, by removing the "center of mass", the 
mean correlation is now approximately zero, as shown in Fig. 
[T] On the other side, the cluster structure seems now to be 
more evident than in the case of the original data. 

In Figs. |9] we present the data for set A (top) and set E 
(bottom) at the two extreme time-horizon of 5 min (left) and 
1 at 1 day (right). Contrary to what we find in set A (top left), 
the HT of set E at 5 min time-horizon (top right) shows a 
significant level of structure that, additionally, is similar to the 
one found at 1 day (op-cl) time-horizon (bottom right). 

This is also confirmed by comparing the structure of the 
MST in sets A and set E. The 5 min MST of set A shows a 
typical structure with a few hubs characterized by an high de- 
gree (Fig. [Tol l. The 1 day (op-cl) MST of set A indicates that 
the number of hubs has increased, reflecting the progressive 
organization of stocks according to their sectors of activity as 
time-horizon increases (Fig. [TTT i. The MSTs shown in Figs. 
[T2] and [T3] for set E are markedly different from the corre- 
sponding ones for set A. No preminent hub is traceable in the 
two MSTs. In addition, they have a structure which is remark- 
ably similar one another, to the extent that one could not say 
which is which, on the basis of their statistical structure alone. 

In order to quantify the difference between the structure of 
the MSTs of different datasets at different time-horizons, we 
performed the Kolmogorov-Smirnov (KS) test 1 19] on the de- 
gree distributions of MSTs. The results for different sets are 
collected in table [III] and it largely confirms the conclusions 
based on visual inspection of Fig. [T0]-[T3] First, we see 
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one associated to uncorrelated random walks. 
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FIG. 9: HT for set A (top) and E (bottom) of NYSE at r = 5 min 
(left) and at daily (op-cl) time-horizon (right). The vertical lines 
represent different stocks. For each stock, colors refer to its eco- 
nomic sector of activity, see table|I] Economic sectors of activity are 
defined according to the classification scheme used in the web-site 
http :/ /finance . yahoo . com/. 
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TABLE III: Results of the Kolmogorov-Smirnov test on the de- 
gree distribution of MSTs for different datasets and time-horizons 
in NYSE. The first row compares the MSTs at r' =5 min and r =1 
day (op-cl) in different datasets X — A, . . . , E. The second (third) 
row compares the structure of the MST of set A at t —5 min (1 
day) with the MSTs in different datasets at the same horizon r. The 
fourth (fifth) row compares the MSTs of sets ^, . . . , i5 at r =5 min 
(1 day) with one generated by a random sample of TV = 100 random 
walks of the same length. The last two rows report the diameters 
of the MSTs at r =5 min and 1 day (op-cl). These should be com- 
pared with the diameter 22 ± 3 of a r-MST generated by uncorrelated 
random walks. 



that the structure of the MSTs at the extremes of the intra- 
day scale range are markedly different in set A and become 
increasingly similar as we move to set E. Second, Table Hill 
shows that the structure of set A is similar to that of other sets 
at the same time-horizon at t = 1 day (op-cl), but this is not 
true at smaller time-horizons. 

We also compared the MSTs with random MST (r-MST) 
generated by uncorrelated random walks of the same length. 
This reveals that, apart from set A, we are not able to detect 
any statistical feature in the degree distribution which differ- 
entiates the MSTs of sets B,C,D and i? at r = 1 day (op-cl) 
from those generated by pure noise. Even the diameter of the 
MSTs is not able to discriminate them from those generated 
by pure noise. However, the similarity of MSTs with r-MST 
disappears for larger datasets of = 500 or = 2000 stocks 
of NYSE, for which KS yields values of p ~ for all sets, at 
both r —5 min and 1 day (op-cl). Furthermore, MSTs turn out 
to be considerably more compact than r-MSTs. For example, 
we find that with N = 500 the r-MST have a diameter of 53 
whereas at r =1 day (op-cl) the largest value of the diameter 
is 37 for set E. Finally, in the case of = 500 stocks, for 
set B, set C, set D and set E we have also performed the KS 
test in order to compare the degree distribution of the MSTs 
at 5 min and 1 day (op-cl). Such tests confirm the result of 
Table Uni valid for iV = 100 stocks, that the degree distri- 
butions are essentially indistinguishable, with p-values which 
are close to zero. Hence, we conclude that the removal of the 
"market mode" generates residues whose MSTs still contain 
non-trivial statistical features, although these are not clearly 
observable in the case of = 100 assets. When considering 
a larger set, say A^ — 500, the noise threshold lowers enough 
to reveal a topological organization which is different from the 




FIG. 10: MST for set A of NYSE at r = 5 min. The vertices 
represent different stocks. For each stock, colors refer to its eco- 
nomic sector of activity, see table |I] Economic sectors of activity are 
defined according to the classification scheme used in the web-site 
http : / / finance . yahoo . com/. 



In Fig. [14] we show the HTs relative to set A (left) and set 
E (right) in the case when the overnight time-horizon is con- 
sidered. The structure of such trees is different form the ones 
at intraday time-horizons. In particular, for set A, the HT of 
Fig. [14] shows that some stocks are highly correlated with 
each other. However, the organization in economic sectors of 
activity is less marked than in the coiTesponding HT at daily 
time-horizon, see Fig. [9] Such effect is also observable when 
considering set E, i.e. the right panel of Fig. [14] Here the av- 
erage level of correlation increases, as expected. It is therefore 
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FIG. 11: MST for set A of NYSE at r =1 day (op-cl). The vertices 
represent different stocks. For each stock, colors refer to its eco- 
nomic sector of activity, see tableU Economic sectors of activity are 
defined according to the classification scheme used in the web-site 
http://finance. yahoo. com/. 




FIG. 13: MST for set E of NYSE at r =1 day (op-cl). The vertices 
represent different stocks. For each stock, colors refer to its eco- 
nomic sector of activity, see table U Economic sectors of activity are 
defined according to the classification scheme used in the web-site 

http://finance. yahoo .com/. 




FIG. 12: MST for set E of NYSE at r = 5 min. The vertices 
represent different stocks. For each stock, colors refer to its eco- 
nomic sector of activity, see table|I] Economic sectors of activity are 
defined according to the classification scheme used in the web-site 

http : / / finance . yahoo . com/. 



evident that at the overnight time-horizon the organization of 
stocks in clusters is different than at intraday time-horizons, 
i.e. when the market is open. 

We have seen above that when removing the market mode 
the topology of the MSTs has no specific statistical features, 
even though the distribution of stocks on them is definitely not 
random. Indeed the cluster structure seen in HTs (Fig. |9]l cor- 
respond to the fact that companies belonging to the same eco- 
nomic sector appear clustered in the same region of the MST. 
Again, this shows that the removal of the "center of mass" re- 
veals the organization in sectors of activity already at such a 
small time-horizons as 5 min. It is worth remarking, though. 
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FIG. 14: HT for set A (left) and E (right) of NYSE at overnight time- 
horizon. The vertical lines represent different stocks. For each stock, 
colors refer to its economic sector of activity, see table |I] Economic 
sectors of activity are defined according to the classification scheme 
used in the web-site http : / /finance . yahoo . com/. 

that the location of sectors themselves along the tree is dif- 
ferent at 5 min and at the intraday scale. In other words, the 
intra-sector structure evolves in time, while the sector compo- 
sition remains stable. 

In order to give a quantitative description of this effect, for 
each set and at each time-horizon we have measured the frac- 
tion of the MST links that are conserved with respect to the 
open-to-close case. The results are reported in Fig. [15] The 
top panel refers to the case when all links in the MST are con- 
sidered. The other two panels refer to the case when we also 
use the information about the economic sectors of activity, see 
TableU In particular, we consider only intra-sector links (mid- 
dle panel) or only inter-sector links (bottom panel). Ideally, 
for a better quantitative description we should have consid- 
ered clusters rather than economic sectors. Unfortunately, the 
SLCA does not allow a precise identification of what a cluster 
is. However, in Fig. |5]it is shown that there exists a strict re- 
lation between economic sectors and the clusters obtained by 
using the methodology of Ref. |6]. We here somehow make 
the ansatz that such strict relation persists also in the clusteri- 
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zation given by the SLCA. 

In the case when we consider all links (top) or only those 
between stocks in the same sector (middle), in all the cases 
but one, when the center of mass has been removed, the frac- 
tion of conserved links is higher than for set A. The middle 
panel of Fig. [B] shows that 70-80% of the MST Unks be- 
tween stocks belonging to the same economic sector are con- 
served with respect to the open-to-close case, whereas a much 
smaller fraction is conserved between stocks belonging to dif- 
ferent economic sectors. This is consistent with our observa- 
tion that while sector composition remains stable, intra-sector 
correlations evolve with the time-horizon. Moreover, such re- 
sults are also consistent with the ones shown in Fig. |5]that the 
amount of economic information contained in the clusters is 
constant. 

In this respect, the botton panel of Fig. [Ts] shows that set 
D and set E reveal better than the others the topogical organi- 
zation within different economic sectors at all time-horizons. 
Finally, it is worth remarking that set C, where the market 
mode is exogenously given by the SP500 index, gives results 
which are comparable with those of set A. 

By summarizing, the investigation of sets A, B, C, D and 
E by using the SLCA shows that (i) the removal of the "center 
of mass" reveals the organization of the sectors within differ- 
ent economic sectors even at small time-horizons and (ii) this 
is better achieved in set D and set E, where the "center of 
mass" is endogeneously obtained either by miminizing the 
function of Eq. |II]or by using a mere return market average. 
Finally, we find that the degree distributions of the MST at 
different time-horizons are statistically the same, specially in 
set E, according the the Kolmogorov-Smirnov test, but they 
cannot be distinguished from those of a set of N independent 
random walks, for such a small market (N = 100). The distri- 
bution of stocks on the MST reflects the organization of stocks 
in economic sectors, and indeed links between companies in 
the same sector are "conserved" across time scales. 



B. Other markets 

The question arises whether the above results have some 
degree of universality or they are peculiar to the NYSE mar- 
ket. We have therefore repeated the above investigations for 
different markets, i.e. for LSE, PB and Bl. Generally we con- 
firmed the main conclusions: We find that HT of sets B,C,D 
and E reveal better the organization of stocks in economic 
sectors than set A, and that the structure of HTs for the form- 
ers is less dependent on the time-horizon r than for the latter 
The structure of MSTs has a clear evolution in set A as the 
time-horizon increases (e.g. KS test yields plse = 0.051 for 
the degree distributions of MSTs of set A between r =5 min 
and 1 day), whereas it has a remarkably stable structure in the 
other sets (particularly for set E, for which plse = 0.999 
between r =5 min and 1 day). A comparison of the MST for 
set E for NYSE and LSE yields a KS test value of p > 0.9 
for all time-horizons r. Similar results were found comparing 
NYSE and PB or Bl MSTs. This invariance of the structure of 
MSTs for set E across markets and time-horizons should not 
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FIG. 15: Fraction of the intraday MST Unks that are conserved with 
respect to the open-to-close case in the NYSE data. We report the 
cases where we consider all the links (top), only intra-sector links 
(middle) or only inter-sector links (bottom). Economic sectors of 
activity are defined according to the classification scheme used in 
the web-site http : / /finance . yahoo . com/. 



be considered as an indication of universality, though. Indeed, 
as for NYSE, this invariant structure is indistinguishable from 
that of r-MSTs generated from uncorrelated random walks. 
Hence, what this allows us to conclude is that markets of such 
small sizes do not allow to make statements on the similarity 
of market topology in terms of their MSTs. Indeed, the topol- 
ogy of MSTs for N w 100 stocks or less, is dominated by 
noise. 

When the market is open, the disposition of stocks on the 
MSTs, as in NYSE, is consistent with economic classifica- 
tion, across time-horizons. In Figs. [T6]we report, for different 
sets and at each time-horizon, the fraction of the MST links 
that are conserved with respect to the open-to-close case for 
LSE (left) and PB (middle) and Bl (right). Again, we con- 
sider all the links (top), only intra-sector links (middle) or 
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FIG. 16: Fraction of the intraday MST links that are conserved 
with respect to the open-to-close case in the LSE (left), PB (mid- 
dle) and BI (right) data. We report the cases where we con- 
sider all the links (top), only intra-sector links (middle) or only 
inter-sector links (bottom). Economic sectors of activity are de- 
fined according to the classification scheme used in the web-site 
http : / / www . euroland . com/. 



only inter-sector links (bottom). As much as in the NYSE 
case, the sectors considered here are the economic sectors of 
activity mentioned above. In the case of LSE data the results 
are less sharp than in the NYSE case. Set B, set D and set 
E give results which are more similar to each other with re- 
spect to the NYSE case. One possible exception is given by 
set B at 5 min time-horizon. In all cases, it is confirmed that 
the removal of the "market mode" reveals the organization of 
stocks in economic sector already at small time-horizons. As 
an example, the fraction of conserved intra-sector links in set 
E is always ranging between 50% and 60%, while in set A 
such percentage drops to 30% at the smallest time-horizon. 
At larger time scales, however, the fraction of conserved links 
for set A has roughly the same value that for other sets. This 
is different from what we found for NYSE, where the fraction 
of conserved links were systematically smaller for set A than 
for other sets. 

When considering the overnight time-horizon, we confirm 
that the organization of stocks in economic sectors of activ- 
ity is less evident than in the case when the market is open. 
However, such differences are less marked than in the NYSE 
case. 



V. CONCLUSIONS 

We found that removing the dynamics of the center of mass 
i) decreases the level of correlations and ii) makes the cluster 
structure more evident. Naively one would expect that reduc- 
ing the level of correlations reduces the "signal" and hence 
enhances the role of noise in the dataset. On this ground, one 
might expect a less sharply defined structure, i.e. the opposite 



of ii). The fact that we observe i) and ii) implies that the mar- 
ket mode dynamics bears little or no information on the mar- 
ket structure. It also suggests that the market mode dynamics 
and the dynamics of "internal coordinates" are to a large ex- 
tent separable, in much the same manner as in particle systems 
of classical mechanics, where the center of mass dynamics ac- 
counts for the effect of external forces, whereas relative coor- 
dinates respond to internal forces arising from inter-particle 
potentials. 

It is not difficult to imagine components of trading activ- 
ity which might contribute to the dynamics of the "center of 
mass" or to relative coordinates. It is worth to remark, in this 
respect, that a simple phenomenological model for the dynam- 
ics of the market mode, taking into account the impact of trad- 
ing in risk minimization strategies, has been recently proposed 
112211 . Besides reproducing the main statistical properties of the 
dynamics of the largest eigenvalue of the covariance matrix, 
this model also shows that the behavior of the market mode is 
largely insensitive to a finer structure of correlations. The in- 
variance of the structure of "internal" coiTelations across time 
scales, and its similarity with economic classification, instead 
suggests that the dynamics of relative coordinates might be 
related to the ways in which information on different assets 
diffuses in the market. 

The finding of a scale-invariant correlation structure is non- 
trivial, in several respects. First, its origin suggests a fine bal- 
ance between signal and noise across time-horizons: On one 
hand, the growth of correlations implicit in the Epps effect 
implies that the "signal" gets stronger as the time scale in- 
creases. On the other, random matrix theory suggests that the 
strength of "noise" due to finite sampling, is more severe at 
large time-horizons than at short ones. Indeed, the length of 
the time series decreases as T '--^ l/r, which implies a spread 
5X ^ ^ N/T ^ y/r in the eigenvalues due to noise dressing. 
This latter effect allows us to detect weak correlation struc- 
tures with an high precision at small time scales. 

Secondly, the scale invariance of correlation structure might 
have important implications for risk management, because it 
suggests that correlations on short time scales might be used 
as a proxy for correlations on longer time-horizons. If the 
structure of correlations at short time scales can be computed 
using shorter time series, this might allows us to detect struc- 
tural changes more efficiently. 

Finally, uncovering the dynamical origin of such a complex 
phenomenology poses exciting challenges to theoretical mod- 
eling of multi-asset markets. 
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